Hearing by eye: visual spatial degradation and the mcgurk effect

نویسندگان

John MacDonald

Soren Andersen

Talis Bachmann

چکیده

McGurk and MacDonald (Nature 264, 746-748, 1976) discovered that when a discrepancy is created between visual information from lip movements and speech information from the auditory channel then perceivers often report a percept that is neither the auditory or visual stimulus, an illusory response. This ‘McGurk’ effect is strong evidence that perceivers extract key information about a speech sound from concomitant visual articulation. This study investigates the effects of spatial quantisation on the McGurk effect. Participants (N=20) were presented with incongruous auditoryvisual combinations of simple consonant vowel tokens. The visual stimulus was intact or had undergone various degrees of degradation through spatial quantisation. McGurk type responses were significantly influenced by levels of quantisation with more veridical auditory responses at the coarser levels of quantisation. However, even at the coarsest level of quantisation some McGurk type responses were reported. INTRODUCTION The issue of how different modalities interact with one another to produce a unified and integrated perceptual environment is most clearly seen in the area of audiovisual speech perception [4, 11]. A number of studies have demonstrated greater intelligibility of speech when lip movement information is available [18]. However, in addition to facilitation there is also evidence for more integrated processing occurring. The most convincing demonstration of this is the McGurk effect [14, 10]. When participants are presented with conflicting information from each of the two modalities, e.g. visual /ga/ with auditory /ba/ the perception of what is heard is either an illusory fusion, i.e. a heard /da/ or a combination, e.g. /bga/ where the lip movements are for /ba/ and the auditory stimulus is /ga/. This is a robust finding which has been confirmed by a number of studies [16, 8, 12, 21, 15]. Although this interactive influence of visible lip movements on auditory speech perception is well established, it is not clear what mechanisms and processes govern the resultant perception. Research has established that the most informative areas of the face are those surrounding the lips, including the jaw and cheekbone [17, 3]. Also eye movement studies have shown that although most fixations of gaze are to the eyes and mouth, the proportion of the duration of the gaze being fixated on the mouth increases as masking noise levels increase [20]. A technique that has recently been used to study what information is detected from faces and how it is used to make perceptual judgements is that of spatial quantisation [1, 7, 5, 2, 19]. To date, only one study has used this technique to investigate visible speech perception [6]. They found that speech reading performance was relatively resistant to the effects of image degradation, although the effect varied across different speech tokens. The rationale behind this technique is to systematically introduce different levels of spatial degradation so that the levels of local feature analysis will be relatively more impaired than the levels of more global configuration analysis, and it should be possible to find out the relative impact of these different levels on the perceptual process under investigation. The present study seeks to apply this approach to the study of audio-visual speech perception using the McGurk effect to auditory-visual conflicting stimuli as the key measure. It is hypothesised that as the McGurk effect depends on the processing and influence of the visual stimulus, and as spatial degradation reduces the information derived from the visual component then the relative frequency of illusory responses should decrease with increasing levels of degradation. There are two main aims: (1) to measure the effects of quantisation in order to find out the tolerance of the McGurk effect to degradation of the visual stimulus and (2) to answer the question of whether there is a gradual decline of the influence of the facial information with coarseness of quantisation or whether there is a critical point at which information is lost resulting in a quantum decrease of the illusion. Previous research with face identification has found that with the gradual increase in coarseness of quantisation, i.e. increase in the size of the pixels, there is a dramatic drop of face recognition at a certain critical value of quantisation as measured by the number of pixels/face [1, 5]. METHOD Participants 20 native speakers of British English, 10 males and 10 females, aged between 18 and 40, with normal or corrected to normal vision and normal hearing took part in a 2 hours session. Stimuli The face of a young woman was videotaped uttering pairs of consonant-vowel (CV) syllables. Eight pairs of CV syllables were used: /ba-ba/, /da-da/, /ga-ga/, /papa/, /ta-ta/, /ka-ka/, /ma-ma/, and /na-na/. Each CV syllable pair was spoken three times with an interval of approximately 1 second between repetitions, producing thus a triple of pairs, e.g. /ba-ba/, /ba-ba/, /ba-ba/. To ensure visibility of the oral cavity and to eliminate shadows in the face, spotlights were placed at an angle of 45°, 2 metres to the left and right of the speaker. For each recording the speaker’s face was positioned at the centre of the camera. She was instructed to articulate the syllables naturally, to avoid artificial emphasis, to close her mouth between repetitions, and to sit as still as possible. Three recordings of each CV syllable triple resulted in a total of 24 master stimuli. From these master stimuli, congruent and incongruent AV speech examples were created from within the following consonants groups; [b d g] (voiced), [p t k] (voiceless) and [m n] (nasal). This procedure gave a total of 22 AV speech combinations; 14 incongruent and 8 congruent. The stimuli were produced by dubbing the selected auditory stimulus onto the selected visual stimulus. To ensure coincidence of the auditory and the visual signal during the release of the consonant in each utterance, single frame editing was used. Degrading the stimuli The spatial quantisation (area averaging mosaic tansformation [9]) of the AV stimuli was performed by using a Panasonic WJ-MX30 mixing/special-effects board and two Panasonic AG7350 video-recorders. From pilot observations and previous speaking face recognition research with quantised stimuli [6], five levels of spatially quantised AV stimuli were prepared: 0 (the original high-resolution image at the TV monitor level of resolution), level 3 (29.2 pixels/face, 11.6 pixels/mouth), level 5 (19.4 p/face, 7.7 p/mouth), level 7 (14.2 p/face, 5.6 p/mouth) and level 9 (11.2 p/face, 4.4 p/mouth). The five levels of spatial quantisation were crossed with the 22 AV speech stimuli, yielding a total of 110 spatially quantised AV stimuli. Blocking the stimuli Each trial consisted of 5 seconds of the face prior to articulation and 1 second of the face after articulation, with a 10 second inter-trial interval. Two pseudorandom sequences of the 110 trials were created, avoiding sequential repetitions of AV stimuli. Procedure The AV stimuli were presented on a 20” monitor, from a S-VHS Panasonic-AG7350 video recorder. Participants sat on a chair approximately 100-110 cm from the screen, fixating the central part of the screen at eye level. The vertical size of the face presented on the screen, measured at spatial degradation level 0, was approximately 28 cm. The horizontal width of the face, from cheekbone to cheekbone was approximately 18.0 cm. The horizontal width of the mouth when the lips were closed was approximately 7.0 cm. At a viewing distance of 100-110 cm, the face subtended 16 vertical and 10.3 horizontal degrees of visual angle, with the mouth subtending 4.0 horizontal degrees of visual angle. All testing was conducted individually in a sound attenuated room. Participants were instructed that they were to watch a video of a woman model uttering some meaningless, but intelligible syllables, and that their task was to report what they heard the model say. Each participant experienced each of the two different sets of sequences of 110 stimuli, i.e. 220 trials in total. Participants received £10 as payment. RESULTS The main dependent variable was the number of correct auditory identification responses which was calculated for each participant for each quantisation condition and each incongruent audio-visual stimulus combination. The overall results are presented in Table 1. The higher the value the weaker the McGurk effect demonstrating audio-visual interaction. The maximum score was 2.0 which represents completely veridical auditory responding on both trials. As can be seen from Table 1 the rate of correct auditory responding was influenced by the auditory-visual stimulus combination (F(9,171) = 10.12, p<0.001) and by level of spatial quantisation (F(4,76) =24.74, p<0.001). There was also a significant interaction between stimulus pair and quantisation (F(36,684) = 5.54, p<0.001). In order to explore these effects a series of subsidiary analyses of variance were carried out. For example it is clear that for many speakers and for listeners to those speakers the phonemes /d/ and /g/ are visually indistinguishable. There should be little difference between responses to the pairings of auditory /ba/ visual /da/ and the pairing of auditory /ba/ visual /ga/. This was tested directly by comparing the number of correct responses across these conditions and across levels of quantisation. Visual /ba/; auditory /da/ or /ga/ There was no significant effect of the auditory stimulus and no significant interaction with level of spatial quantisation. However, there was a significant main effect of spatial quantisation (F(4,76) = 2.90, p<0.05). Participants made fewer errors at coarser levels of quantisation. Visual /pa/; auditory /ta/ or /ka/ There was no main effect of the auditory stimulus used but there was a significant main effect of quantisation (F(4,76) = 3.22, p<0.05) and a significant interaction between auditory stimulus and quantisation (F(4,76) =4.29, p<0.005). This later interaction resulting from Table 1 Mean number of correct auditory responses as a function of visual context, auditory stimulus and spatial quantisation level (Standard Deviations in parentheses, N=20). Visual /ba/ /ba/ /da/ /ga/ /pa/ /pa/ /ta/ /ka/ /ma/ /na/ Auditory /da/ /ga/ /ba/ /ba/ /ta/ /ka/ /pa/ /pa/ /na/ /ma/

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Spatial Frequency Requirements and Gaze Strategy in Visual-Only and Audiovisual Speech Perception.

PURPOSE The aim of this article is to examine the effects of visual image degradation on performance and gaze behavior in audiovisual and visual-only speech perception tasks. METHOD We presented vowel-consonant-vowel utterances visually filtered at a range of frequencies in visual-only, audiovisual congruent, and audiovisual incongruent conditions (Experiment 1; N = 66). In Experiment 2 (N = ...

متن کامل

Face configuration affects speech perception: Evidence from a McGurk mismatch negativity study

We perceive identity, expression and speech from faces. While perception of identity and expression depends crucially on the configuration of facial features it is less clear whether this holds for visual speech perception. Facial configuration is poorly perceived for upside-down faces as demonstrated by the Thatcher illusion in which the orientation of the eyes and mouth with respect to the fa...

متن کامل

The relationship between level of autistic traits and local bias in the context of the McGurk effect

The McGurk effect is a well-known illustration that demonstrates the influence of visual information on hearing in the context of speech perception. Some studies have reported that individuals with autism spectrum disorder (ASD) display abnormal processing of audio-visual speech integration, while other studies showed contradictory results. Based on the dimensional model of ASD, we administered...

متن کامل

Audiovisual sentence recognition not predicted by susceptibility to the McGurk effect.

In noisy situations, visual information plays a critical role in the success of speech communication: listeners are better able to understand speech when they can see the speaker. Visual influence on auditory speech perception is also observed in the McGurk effect, in which discrepant visual information alters listeners' auditory perception of a spoken syllable. When hearing /ba/ while seeing a...

متن کامل

The Effects of Separating Auditory and Visual Sources on Audiovisual Integration of Speech

When the image of a speaker saying the bisyllable /aga/ is presented in synchrony with the sound of a speaker saying /aba/, subjects tend to report hearing the sound /ada/. The present experiment explores the effects of spatial separation on this class of perceptual illusion known as the McGurk effect. Synchronous auditory and visual speech signals were presented from different locations. The a...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1999

Hearing by eye: visual spatial degradation and the mcgurk effect

نویسندگان

چکیده

منابع مشابه

Spatial Frequency Requirements and Gaze Strategy in Visual-Only and Audiovisual Speech Perception.

Face configuration affects speech perception: Evidence from a McGurk mismatch negativity study

The relationship between level of autistic traits and local bias in the context of the McGurk effect

Audiovisual sentence recognition not predicted by susceptibility to the McGurk effect.

The Effects of Separating Auditory and Visual Sources on Audiovisual Integration of Speech

عنوان ژورنال:

اشتراک گذاری